NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Scaling Laws for the Workload Throughput of Emerging Heterogeneous Clusters

https://doi.org/10.1109/CCGRID64434.2025.00025

Alasandagutti, Akhil; Suetterlein, Joshua; Firoz, Jesun; Young, Stephen; Manzano, Joseph; Stewart, Jason R; Bridges, Patrick G; Estrada, Trilce; Barker, Kevin (May 2025, IEEE)

Not AvailableNext-generation HPC clusters are evolving into highly heterogeneous systems that integrate traditional computing resources with emerging accelerator technologies such as quantum processors, neuromorphic units, dataflow architectures, and specialized AI accelerators within a unified infrastructure. These advanced systems enable workloads to dynamically utilize different accelerators during various computation phases, creating complex execution patterns. The performance of the workloads can therefore be impacted by many factors, including how the accelerators are shared, their utilization, and their placement within the system. Moreover, effects such as the system and network state due to the overall system load can significantly impact the job completion rate. Understanding, identifying, and quantifying the impact of the most critical factors (e.g., the number of allocated accelerators) will help decide the investment decisions for accelerator acquisition and deployment that can improve the overall system throughput. This paper extensively studies these complex interactions among advanced accelerators within an HPC cluster and various workloads. We introduce a novel analytical model which predicts the speedup of a workload given an accelerator/system configuration. This model can be used to quantify the effect of augmenting additional accelerators on job performance running on an HPC cluster. We validate the model using both simulated and real environments.
more » « less
Free, publicly-accessible full text available May 19, 2026
{MGG}: Accelerating Graph Neural Networks with {Fine-Grained} {Intra-Kernel} {Communication-Computation} Pipelining on {Multi-GPU} Platforms

Wang, Yuke; Feng, Boyuan; Wang, Zheng; Barker, Kevin; Li, Ang; Ding, Yufei. (July 2023, USENIX Association)

The increasing size of input graphs for graph neural networks (GNNs) highlights the demand for using multi-GPU platforms. However, existing multi-GPU GNN systems optimize the computation and communication individually based on the conventional practice of scaling dense DNNs. For irregularly sparse and fine-grained GNN workloads, such solutions miss the opportunity to jointly schedule/optimize the computation and communication operations for high-performance delivery. To this end, we propose MGG , a novel system design to accelerate full-graph GNNs on multi-GPU platforms. The core of MGG is its novel dynamic software pipeline to facilitate fine-grained computation-communication overlapping within a GPU kernel. Specifically, MGG introduces GNN-tailored pipeline construction and GPU-aware pipeline mapping to facilitate workload balancing and operation overlapping. MGG also incorporates an intelligent runtime design with analytical modeling and optimization heuristics to dynamically improve the execution performance. Extensive evaluation reveals that MGG outperforms state-of-the-art full-graph GNN systems across various settings: on average 4.41×, 4.81×, and 10.83× faster than DGL, MGG-UVM, and ROC, respectively.
more » « less
Full Text Available
Spy in the GPU-box: Covert and Side Channel Attacks on Multi-GPU Systems

https://doi.org/10.1145/3579371.3589080

Dutta, Sankha Baran; Naghibijouybari, Hoda; Gupta, Arjun; Abu-Ghazaleh, Nael; Marquez, Andres; Barker, Kevin (June 2023, International Symposium on Computer Architecture (ISCA))

Full Text Available
Direction-Optimizing Label Propagation Framework for Structure Detection in Graphs: Design, Implementation, and Experimental Analysis

https://doi.org/10.1145/3564593

Liu, Xu T.; Lumsdaine, Andrew; Halappanavar, Mahantesh; Barker, Kevin; Gebremedhin, Assefaw H. (October 2022, ACM Journal of Experimental Algorithmics)

Label Propagation is not only a well-known machine learning algorithm for classification, but it is also an effective method for discovering communities and connected components in networks. We propose a new Direction-Optimizing Label Propagation Algorithm (DOLPA) framework that enhances the performance of the standard Label Propagation Algorithm (LPA), increases its scalability, and extends its versatility and application scope. As a central feature, the DOLPA framework relies on the use of frontiers and alternates between label push and label pull operations to attain high performance. It is formulated in such a way that the same basic algorithm can be used for finding communities or connected components in graphs by only changing the objective function used. Additionally, DOLPA has parameters for tuning the processing order of vertices in a graph to reduce the number of edges visited and improve the quality of solution obtained. We present the design and implementation of the enhanced algorithm as well as our shared-memory parallelization of it using OpenMP. We also present an extensive experimental evaluation of our implementations using the LFR benchmark and real-world networks drawn from various domains. Compared with an implementation of LPA for community detection available in a widely used network analysis software, we achieve at most five times the F-Score while maintaining similar runtime for graphs with overlapping communities. We also compare DOLPA against an implementation of the Louvain method for community detection using the same LFR-graphs and show that DOLPA achieves about three times the F-Score at just 10% of the runtime. For connected component decomposition, our algorithm achieves orders of magnitude speedups over the basic LP-based algorithm on large diameter graphs, up to 13.2 × speedup over the Shiloach-Vishkin algorithm, and up to 1.6 × speedup over Afforest on an Intel Xeon processor using 40 threads.
more » « less
Full Text Available
Bit-GraphBLAS: Bit-Level Optimizations of Matrix-Centric Graph Processing on GPU

https://doi.org/10.1109/IPDPS53621.2022.00056

Chen, Jou-An; Sung, Hsin-Hsuan; Shen, Xipeng; Tallent, Nathan; Barker, Kevin; Li, Ang (May 2022, 36th IEEE International Parallel & Distributed Processing Symposium)

Full Text Available
Leaky Buddies: Cross-Component Covert Channels on Integrated CPU-GPU Systems

https://doi.org/10.1109/ISCA52012.2021.00080

Dutta, Sankha Baran; Naghibijouybari, Hoda; Abu-Ghazaleh, Nael; Marquez, Andres; Barker, Kevin (June 2021, International Symposium on Computer Architecture (ISCA))

Full Text Available
Direction-optimizing label propagation and its application to community detection

https://doi.org/10.1145/3387902.3392634

Liu, Xu T.; Halappanavar, Mahantesh; Barker, Kevin J.; Lumsdaine, Andrew; Gebremedhin, Assefaw H. (May 2020, Computing Frontiers)

Full Text Available
A parallel sparse tensor benchmark suite on CPUs and GPUs

https://doi.org/10.1145/3332466.3374513

Li, Jiajia; Lakshminarasimhan, Mahesh; Wu, Xiaolong; Li, Ang; Olschanowsky, Catherine; Barker, Kevin (February 2020, Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming)

Tensor computations present significant performance challenges that impact a wide spectrum of applications. Efforts on improving the performance of tensor computations include exploring data layout, execution scheduling, and parallelism in common tensor kernels. This work presents a benchmark suite for arbitrary-order sparse tensor kernels using state-of-the-art tensor formats: coordinate (COO) and hierarchical coordinate (HiCOO). It demonstrates a set of reference tensor kernel implementations and some observations on Intel CPUs and NVIDIA GPUs.
more » « less
Full Text Available
BSTC: a novel binarized-soft-tensor-core design for accelerating bit-based approximated neural nets

https://doi.org/10.1145/3295500.3356169

Li, Ang; Geng, Tong; Wang, Tianqi; Herbordt, Martin; Song, Shuaiwen Leon; Barker, Kevin (November 2019, SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis)

Full Text Available
Distributed Direction-Optimizing Label Propagation for Community Detection

https://doi.org/10.1109/HPEC.2019.8916215

Liu, Xu; Firoz, Jesun Sahariar; Zalewski, Marcin; Halappanavar, Mahantesh; Barker, Kevin J.; Lumsdaine, Andrew; Gebremedhin, Assefaw H. (September 2019, 2019 IEEE High Performance Extreme Computing Conference (HPEC))

Full Text Available

Search for: All records